This is a red wine data analysis project. In this project, we analyse what chemical feature impacts the quality of the red wine.
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## [1] 1599 12
The data has 1599 rows and 12 columns
The structure of the data is as follows:
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
The summary statistics of the data are as follows:
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
We observe that there are 11 characteristics which effect the quality. We will perform analysis to # Univariate Plots Section
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
From the above plot and summary we see, that mean of the quality is 5.6, which means most of the observations are in quality levels obseved between 5 and 6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
In the above, we see that the mode and median are almost similar and the min and max are also very close implying the pH values mostly range from 3-4 with median and mean at 3.3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
From above, we see the fixed acidity analysis.Must of the data has a fixed acidity between 7-10.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
From the plot we see that most of the data, is focussed between 0.1 and 1.0. For better visualization, we limit our x-axis , to observe the following figure.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
From the summary and graph of citric acid, we obseve that the observations are more in the range 0 and 0.5, which gives our mean and median around 0.27 and 0.26 respectively while still holding a max of 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
In the above density analysis plot we observe that the density focusses on very minute value changes and is highly oriented between 0.992-1.003
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
From the plot we see that most of the data, is focussed between 0.05 and 0.2. For better visualization, we limit our x-axis , to observe the following figure.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
From the plot we see that most of the data, is focussed between 1 and 8. For better visualization, we limit our x-axis , to observe the following figure.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
From the plot we see that most of the data, is focussed between 0 and 50. For better visualization, we limit our x-axis , to observe the following figure. Also we observe that the number of observations gradually decrease with increase in free sulphur dioxide content.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
From the plot we see that most of the data, is focussed between 0 and 150. For better visualization, we limit our x-axis , to observe the following figure. Also we observe that the number of observations gradually decrease with increase in total sulphur dioxide content.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
From the plot we see that most of the data, is focussed between 0.25 and 1.1. For better visualization, we limit our x-axis , to observe the following figure. Also we observe that the number of observations most of the observations are between 0.5 and 0.75 with mean and median around 0.65 and 0.62 respectively
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
In the above alcohol analysis plot we observe that the alcohol focusses on values from 8-14 changes and is highly oriented between 9-12.
After performing initial and univariate analysis, now we answer some of the important questions that gives an idea of the data and helps us make further analys
The data consists of 1599 observations and 12 characteristics. 11 characteristic fearures and 1 impacting feauture Quality.
Quality is of main interest in the data. And also all the other 11 characters.
investigation into your feature(s) of interest? Most of the observations have: 1.Quality of 5,6. Very minimum observations are seen with quality 3,7,8. 2.Average pH is over 3 and 3.5 3.Alcohol value between 9 and 11 4.very less citric acid, density, chlorides and sulphates value.
No.
Volatile acid, chlorides,citric acid residual sugar, total sulpher dioxide, sulphates have very minimal observations as the value increases. X limits on these plots have been adjusted for better understanding.and visualization.
Let us look at the correlation statistics
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.256130895 0.67170343
## volatile.acidity -0.25613089 1.000000000 -0.55249568
## citric.acid 0.67170343 -0.552495685 1.00000000
## residual.sugar 0.11477672 0.001917882 0.14357716
## chlorides 0.09370519 0.061297772 0.20382291
## free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813
## total.sulfur.dioxide -0.11318144 0.076470005 0.03553302
## density 0.66804729 0.022026232 0.36494718
## pH -0.68297819 0.234937294 -0.54190414
## sulphates 0.18300566 -0.260986685 0.31277004
## alcohol -0.06166827 -0.202288027 0.10990325
## quality 0.12405165 -0.390557780 0.22637251
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.114776724 0.093705186 -0.153794193
## volatile.acidity 0.001917882 0.061297772 -0.010503827
## citric.acid 0.143577162 0.203822914 -0.060978129
## residual.sugar 1.000000000 0.055609535 0.187048995
## chlorides 0.055609535 1.000000000 0.005562147
## free.sulfur.dioxide 0.187048995 0.005562147 1.000000000
## total.sulfur.dioxide 0.203027882 0.047400468 0.667666450
## density 0.355283371 0.200632327 -0.021945831
## pH -0.085652422 -0.265026131 0.070377499
## sulphates 0.005527121 0.371260481 0.051657572
## alcohol 0.042075437 -0.221140545 -0.069408354
## quality 0.013731637 -0.128906560 -0.050656057
## total.sulfur.dioxide density pH
## fixed.acidity -0.11318144 0.66804729 -0.68297819
## volatile.acidity 0.07647000 0.02202623 0.23493729
## citric.acid 0.03553302 0.36494718 -0.54190414
## residual.sugar 0.20302788 0.35528337 -0.08565242
## chlorides 0.04740047 0.20063233 -0.26502613
## free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
## total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
## density 0.07126948 1.00000000 -0.34169933
## pH -0.06649456 -0.34169933 1.00000000
## sulphates 0.04294684 0.14850641 -0.19664760
## alcohol -0.20565394 -0.49617977 0.20563251
## quality -0.18510029 -0.17491923 -0.05773139
## sulphates alcohol quality
## fixed.acidity 0.183005664 -0.06166827 0.12405165
## volatile.acidity -0.260986685 -0.20228803 -0.39055778
## citric.acid 0.312770044 0.10990325 0.22637251
## residual.sugar 0.005527121 0.04207544 0.01373164
## chlorides 0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide 0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide 0.042946836 -0.20565394 -0.18510029
## density 0.148506412 -0.49617977 -0.17491923
## pH -0.196647602 0.20563251 -0.05773139
## sulphates 1.000000000 0.09359475 0.25139708
## alcohol 0.093594750 1.00000000 0.47616632
## quality 0.251397079 0.47616632 1.00000000
Let’s examine the correlation matrix with the correlative value. We know that the correlation values range from -1 to 1, with value towards 1 are positively correlated and towards -1 is negatively correlated
Let us draw the correlation pairs figure and observe the graphs
When we see how quality is correlated with others, we observe the following:
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## [1,] 0.1240516 -0.3905578 0.2263725 0.01373164 -0.1289066
## free.sulfur.dioxide total.sulfur.dioxide density pH
## [1,] -0.05065606 -0.1851003 -0.1749192 -0.05773139
## sulphates alcohol quality
## [1,] 0.2513971 0.4761663 1
We see that quality is positive correlated with alcohol,fixed acidity citric acid,residual sugar and sulphates and negatively correlated with volatile acidity, chlorides, free sulphur dioxide, total sulphur dioxide, desnity and pH.
Let us explore other correlations.
We have now observed the correlation and impact of other factors on quality.
We observe that Quality has: 1.significant positive correlation with Fixed Acidity, Citric Acid,Sulphates
and Alcohol 2.No significant positive or negative relation with density and residual sugar. 3.Significant negative correlation with chlorides and volatile acidity
Density and citric acidity increase with fixed acidity whole pH decreases. Total Sulphur dioxide and free sulphur dioxide are linearly positive.
Quality increases with citric acid and decreases with volatile acidity
With increase in free and total sulphur dioxe, quality increases. Quality decreases with increase in volatile acidity and fixed acidity Citric acidity increases with fixed acidity and so does quality
Volatile acidity decreases with increase in cirtic acidity as well as fixed acidity. Most of the high quality observations have density below 20.
We see that high alcohol and low density gives good alcohol quality. ### Plot Two
From above we see that as total and free sulphur dioxide increaseand quality increases, but low quantities of each gives good quality.
Some analysis has been done to understand the impacting features on redwine dataquality. We understand how some features alone do not effect the quality. It is important to do a complete analysis before coming to conclusions. Some of the interesting findings made through our analysis: 1.More than 60% of data have avergae quality 2.Free and total sulphur dioxide increase parallelly but lower quantities of these ensure good quality. 3.High alcohol rate and low density increases quality.
I Struggled to understand the data set and analysis what plots can make better visualization. Although initially I faced difficulty, analysis went easy once started working. Suprising is how most of the observations have average quality value and how it decreases with density. Statistical tests and analysis can be done to proove the analysis mathematically.